Model Selection

Visual Question Answering

# Visual Question Answering

SpaceOm-GGUF is a multimodal model focusing on visual question answering tasks and performs excellently in spatial reasoning.

Text-to-Image English

Gemma 3 12b It Qat Int4 GGUF

Gemma 3 is Google's lightweight open model series based on Gemini technology. The 12B version employs Quantization-Aware Training (QAT) technology, supports multimodal input, and features a 128K context window.

GIT is a Transformer-based image-to-text generation model capable of generating descriptive text from input images.

PyTorch Supports Multiple Languages

Vora 7B Instruct

VoRA is a vision-language model based on 7B parameters, focusing on image-text-to-text conversion tasks.

Sapnous-6B is an advanced vision-language model that enhances perception and understanding of the world through powerful multimodal capabilities.

Transformers English

Gemma 3 12b It GGUF

Gemma 3 is a lightweight open-source multimodal model series launched by Google, built on the same technology as Gemini, supporting text and image inputs and generating text outputs

Gemma is a lightweight cutting-edge open model series launched by Google, built on the same technology as Gemini, supporting multimodal input and text output.

Smolvlm2 500M Video Instruct

A lightweight multimodal model designed for analyzing video content, capable of processing video, image, and text inputs to generate text outputs.

Transformers English

Smolvlm2 256M Video Instruct

SmolVLM2-256M-Video is a lightweight multimodal model specifically designed for analyzing video content, capable of processing video, image, and text inputs to generate text outputs.

Transformers English

Qwen2.5 VL 7B Instruct Quantized.w8a8

Quantized version of Qwen2.5-VL-7B-Instruct, supporting vision-text input and text output, optimized for inference efficiency through INT8 weight quantization

Transformers English

Qwen2.5 VL 7B Instruct FP8 Dynamic

The FP8 quantized version of Qwen2.5-VL-7B-Instruct, supporting efficient vision-text inference through vLLM

Transformers English

Qwen2.5 VL 3B Instruct FP8 Dynamic

The FP8 quantized version of Qwen2.5-VL-3B-Instruct, supporting visual-text input and text output, and optimizing inference efficiency.

Transformers English

LlamaV-o1 is an advanced multimodal large language model specifically designed for complex visual reasoning tasks, optimized through curriculum learning techniques, demonstrating outstanding performance across diverse benchmarks.

Safetensors English

Microsoft Git Base

GIT is a Transformer-based generative image-to-text model capable of converting visual content into textual descriptions.

Image-to-Text Supports Multiple Languages

Paligemma2 3b Pt 896

PaliGemma 2 is a multimodal vision-language model that combines image and text inputs to generate text outputs. It supports multiple languages and is suitable for various vision-language tasks.

Dermatech Qwen2 VL 2B

A dermatology-specific diagnostic model fine-tuned using LoRA technology based on Qwen2-VL-2B-Instruct, capable of analyzing skin condition images and providing professional diagnostic descriptions.

Florence 2 FT Lung Cancer Detection

A lung cancer detection model fine-tuned based on Florence-2-base-ft, identifying lung cancer types through lung images

Transformers English

The Peacock Model is an Arabic multimodal large language model based on the InstructBLIP architecture, with AraLLaMA as its language model.

Image-to-Text Arabic

Qwen Vl Guidance

GUIChat is a multimodal model based on Visual Question Answering (VQA), capable of understanding image content and answering related questions, specifically optimized for GUI element recognition and interaction.

Donut is a Transformer-based image-to-text model capable of extracting and generating textual content from images.

Paligemma 3B Chat V0.2

A multimodal dialogue model fine-tuned based on google/paligemma-3b-mix-448, optimized for multi-turn conversation scenarios

Transformers Supports Multiple Languages

Paligemma Vqav2

This model is a fine-tuned version of google/paligemma-3b-pt-224 on a subset of the VQAv2 dataset, specializing in visual question answering tasks.

360VL is a multimodal model developed based on the LLama3 language model, featuring powerful image understanding and bilingual dialogue capabilities.

Transformers Supports Multiple Languages

Llava Llama 3 8b

A large multimodal model trained based on the LLaVA-v1.5 framework, using the 8-billion-parameter Meta-Llama-3-8B-Instruct as the language backbone and equipped with a CLIP-based visual encoder.

Llava NeXT Video 7B DPO

LLaVA-Next-Video is an open-source multimodal dialogue model, fine-tuned with multimodal instruction-following data on large language models, supporting video and text multimodal interactions.

UForm-Gen2-dpo is a small generative vision-language model, aligned for image caption generation and visual question answering tasks through Direct Preference Optimization (DPO) on VLFeedback and LLaVA-Human-Preference-10K preference datasets.

Transformers English

MoAI is a large-scale language and vision hybrid model capable of processing both image and text inputs to generate text outputs.

Llava Maid 7B DPO GGUF

LLaVA is a large language and vision assistant model capable of handling multimodal tasks involving images and text.

Candle Llava V1.6 Mistral 7b

LLaVA is a vision-language model capable of understanding and generating text related to images.

Llava V1.5 13b Dpo Gguf

LLaVA-v1.5-13B-DPO is a vision-language model based on the LLaVA framework, trained with Direct Preference Optimization (DPO) and converted to GGUF quantized format to improve inference efficiency.

Llava V1.6 34B Gguf

LLaVA 1.6 34B is an open-source multimodal chatbot model developed by fine-tuning a large language model on multimodal instruction-following data. It supports image-to-text and text-to-text generation tasks.

Llava V1.6 Vicuna 13b

LLaVA is an open-source multimodal chatbot, fine-tuned on large language models with multimodal instruction-following data.

Llava V1.6 Mistral 7b

LLaVA is an open-source multimodal chatbot, trained by fine-tuning large language models on multimodal instruction-following data.

MiniCPM-V is an efficient lightweight multimodal model optimized for edge device deployment, supporting bilingual (Chinese-English) interaction and outperforming models of similar scale.

A 1.6B-parameter multimodal model combining SigLIP and Phi-1.5 architectures, supporting image understanding and Q&A tasks

Transformers English

Med BLIP 2 QLoRA

BLIP2 is a vision-language model based on OPT-2.7B, focusing on visual question answering tasks. It can understand image content and answer related questions.

InfiMM is a multimodal vision-language model inspired by the Flamingo architecture, integrating the latest LLM models and suitable for a wide range of vision-language processing tasks.

Transformers English

UForm-Gen-Chat is the fine-tuned multimodal conversational version of UForm-Gen, primarily used for image caption generation and visual question answering tasks.

Transformers English

UForm-Gen is a small generative vision-language model primarily used for image caption generation and visual question answering.

Transformers English

Yi-VL-34B is an open-source multimodal model from the Yi series, capable of understanding image content and engaging in multi-turn conversations, with outstanding performance on the MMMU and CMMMU benchmarks.

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase